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(54) Method of using primary and secondary processors 

(57) The invention relates to the compilation of 
source code to a primary and a secondary processor, tt 
relates to reconfigurable secondary processors, and is 
especially relevant to secondary processors which can 
be reconfigured to some degree during execution of 
coda Selective extraction of dataflows from the source 
code is followed by transformation of the extracted data- 
flows into trees. The trees are then matched against 
each other to determine minimum edit cost relation- 
ships for transformation of one tree into another, where 
these minimum edit cost relationships are determined 
by the architecture of the secondary processor. A group 
or a plurality of groups of dataflows is determined on the 
basis of said minimum edit cost relationships and for 
each group a generic dataflow capable of supporting 
each dataflow in that group is created. The generic 
dataflow or dataflows is then used to determine the 
hardware configuration of the secondary processor; 
and calls to the secondary processor for said group or 
plurality of groups of dataflows are substituted into the 
source code. The resultant source code is compiled to 
the primary processor. 

The resulting efficient configuration thus reduces 
either the expense of reconfiguration (in a field program- 
mable array), or the silicon area (in an application spe- 
cific integrated circuit). 
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Description 



[0001 ] The present invention relates to the compilation and execution of source code for a processor architecture con- 
sisting of a primary processor and one (or more) secondary processors. The invention is particularly, though not exdu- 

s sively. relevant to the arc^ftectures ennploying a 

{0002] A primary processor - such as a Pentium processor hi a conventional PC (Pentium is a trade Mark of Intel 
Corporation) - has evolved to be versatile, in that it is adapted to handle a wide rage of computational tasks without 
being optimised for any of them. Such a processor is thus not optimised to handle efficiently cohrputafionaily intensive 
operations, such as parallel sub-word tasks. Such tasks can cause significant bottlenecks in the execution of code. 

io [0003] An approach taken to solve this problem is the development of integrated circuits specifically adapted for par- 
ticular applications. These are known as ASICs, or appflc^on-specific integrated circuits. Tasks for which such a ASIC 
is adapted are generally performed very well i: however, the ASIC wffl generally perform poorly, if at all. on tasks for which 
it is not configured. Clearly, a specific IC can be buOt for a particular application, but this is not a desirable solution for 
applications that are not central to the operation of a computer, or are not yet determined at the time of building the com- 

rs puter. It is thus particularly advantageous for a ASIC to be reconfiguraWe. so that it can be optimised for different appli- 
cations as required. The commonest form of architecture for such devices is the field programmable gate array (FPQA). 
a fine-grained processor structure which can be configured to have a structure which is suited to any given application! 
Such structures can be used as independent processors in suitable contexts, but are also particularly appropriate to 
use as coprocessors. 

20 [0004] Such configurable coprocessors 

particular tasks, code run inefficiently by the primary processor can be extracted and run more efficiently in an adapted 
coprocessor which has been optimised for that appficatton. With continued development of such "application-specific" 
secondary processors, the possibility of improving performance by extracting cfifficult code to a custom coprocessor 
becomes more attractive. A particularly important example in general computing is the extraction of loop bodies in 

25 image handling. 

[0005] To obtain the desired efffciency gains, it is necessary to determne as eflectively as possUe how code is to be 
divided between primary and secondary processors, and to configure the secondary processor for optimal execution of 
Hs assigned part of the code. One approach is to mark the code appropriately on its creation for mapping to coproces- 
sor structures. In *A C + + compiler for FPGA custom execution units synthesis". Christian Iseli and Eduardo Sanchez. 
IEEE Symposium on FPGAs for Custom Computing Machines, Napa. California, April 1995, a approach is employed 
which involves mapping of C + + to FPGAs in VUW (Very-Long Instruction Word) structures after appropriate tagging 
of the initial code by the programmer. This approach relies on the initial programmer making a good choice of code to 
extract initially. 

[0006] An alternative approach is to assess the initial code to determine which the most appropriate elements to direct 
to the secondary processor will be. Two-Level Hardware/Software Partitioning Using CoDe-X". Reiner W. Hartenstein, 
JOrgen Becker and Rainer Kress, in Int. IEEE Symposium on Engineering of Computer Based Systems (ECBS), FnV 
drichshafen, Germany, March 1 996. discusses a codesign topi which incorporates a profier to assess which parte of a 
initial code are suitable for allocation to a coprocessor and which should be reserved for the primary processor. This is 
followed by a iterative procedure allowing for compilation of a subset of C code to a reconf igurable coprocessor archi- 
tecture so that the extracted code can be mapped to the coprocessor. This approach does expand the usage of sefr 
ondary processors, but does not fully realize the potential of reconfiguraWe logic. 

[0007] Comparable approaches have been proposed in the BRASS research project at the University of Berkeley. An 
approach discussed in "Datapath-Oriented FPGA Mapping and Placement", Tim Callahan & John Wawrzynek. a poster 
presented at FCCM , 97, Symposium on Field-Programmable Custom Computing Machines, April 16-18 1997 Napa 

45 Vane * California (currently available on the World Wide Web at http^/www.csi>erke- 
ley.eduyfcrojectstorass/^ uses template structures representative of a FPGA architecture to 

assist in the mapping of source code on to FPGA structures. Source code samples are rendered as cfirected acyclic 
graphs, or DAGs, and then reduced to trees. These and other basic graph concepts are set out for example, in "High 
Performance Compilers for Parallel Computing". Michael Wolfe, pages 49 to 56, Addison-Wesiey. Redwood City 1996 

so but a brief definition of a DAG and a tree follows here. 

[0008J A graph consists of a set of nodes, and a set of edges: each edge is defined by a pair of nodes (and can be 
considered graphically as a line joining those nodes). A graph can be either directed or undirected: in a directed graph 
each edge has a direction. If it posstole to define a path within a graph from one node back to itself, then the graph is 
cyclic: if not then the graph is acyclic. A DAG is a graph that is both directed and acyclic: it is thus a hierarchical struc- 

55 A free is a specific kind of DAG A tree has a single source node; termed "root", and there is a unique path from 
root to every other node in the tree. If there is an edge X-*Y in a tree, then node X is termed the parent of Y. and Y is 
termed the child of X In a tree, a "parent node" has one or more "child nodes", but a child node can have only one par- 
ent whereas in a general DAG, a child can have more than one parent Nodes of a tree with no children are termed leaf 
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-i ". nodes. 

[0009] In the woric of Tim Callahan & John VVa^ 

"tree covering" program called Iburg. iburg is a generally available software tool, and its application is descrfced in "A 
Retargetable C Compiler: Design and Implementation". Christopher W. Fraser and David ft Hanson. 6ergamin/Cunv 
5 mings Publishing Ca. Inc.. Redwood City. 1995, especially at pp 373-407. Iburg takes as input the source code trees 
and partitions this input into chunks that correspond to instructions on the target processor. This partition is termed a 
tree cover. This approach 

complex; it Involves a bottom-up matching of a tree with patterns, recording all possfole matches, followed by a top- 
down reduction pass to determine which match of patterns provides the lowest cost. Again, this approach requires a 
A io significant initial constraint in the form of the predefined set of allowable patterns, and does not fully realize the possi- 
bilities of a recortigurable architecture. 

[0010] There is thus a need to develop techniques and approaches to further improves 

terns involving a primary and secondary processor, by which an optimal choice can be made for allocation of code to a 
secondary processor, which can then be configured as efficiently as poss&le to run the extracted code, with a view to 
15 maximising the performance efficiency of the primary and secondary processor system in execution of input code. 
[0011] Accordingly, the invention provides a method of compffing sow 

sor. comprising: selective extraction of dataflows from the source c^^ransformation erf the extracted dataflows into 
trees; matching of the trees against each other to determine minimum edit cost relationships for transformation of one 
tree into another;determining a group or a plurality of groups of dataflows on the basis of said minimum edit cost rela- 
te ttonships and creating for each group a generic dataflow capable of supporting each dataflow in that group; using the 
generic dataflow or dataflows to determine the hardware configuration of the secondary processor; and substituting into 
the source code calls to the secondary processor for said group or pjurality of groups of dataflows, and compiling the 
resultant source code to the primary processor. 

[0012] This approach allows for optimal selection of source code dataflows for allocation to the secondary processor 
25 without prejudgement of suitability (by, for example, mapping onto predetermined templates) but while still taking full 
acxxwm of the demands and requfr^ 

cost relationships are determined according to the architecture of the secondary processor, and represent a hardware 

cost of a cx^esponding reconfiguration of thesecondary processor. The method is particularly effective if the minimum 

edit cost relationships are embodied in a taxonomy of mirtirnu^ 
30 [0013] The method finds its most usefJ application, where the hardware configuration of the secondary processor 

allows for reconfiguration of the secondary processor during execution of the source code, as this allows for recorrfigu- 
j ration of the secondary processor to be required during execution of the source code to support each dataflow in the 

group supported by a generic dataflow. The secondary processor may thus be an application specific instruction pn» 

essor, and the processor hardware may be a field programmable gate array or a field programmable arithmetic array 
as (such as that shown in the CHESS architecture discussed in Appendix A). 

[0014] Advantageously, the generic dataflow of a group is calculated by an approximate mapping of dataflows in the 

group on to each other, followed by a merge operation. 

[0015] An advantageous approach to construction of a generic dataflow is to provide the dataflows as directed acy- 
clical graphs and reduce them to trees by removal of any finks in the directed acyclical graphs not present in a critical 
40 pam between a leaf node and the root of a ^ 

which passes through the largest number of intermediate nodes. Alternative criteria to the critical path can be adopted 
if more appropriate to the secondary processor hardware (for example, if a different criterion can be found which is more 
sensitive to the timing of operations in the secondary processor). 

[001 6] An advantageous further step can be taken after the creation of a generic dataflow,* in which the generic data- 
45 flow is compared with further dataflows extracted from the source code, wherein those of said further dataflows which 
match sufficiently closely the generic dataflow are added to the generic dataflow. This enables more or all of the code 
present in the source code which is suitable for allocation to the secondary processed to be so allocated. 
[0017] In the approaches indicated above, the removed links are stored after the cfirected acyclical graphs are 
reduced to trees and are reinserted into the generic dataflow after the merging of the trees of the group into the generic 
so dataflow. 

[0018] Specific embodiments of the invention are desenbed below, by way of example, with reference to the accom- 
panying drawings, of which: 

Figure 1 shows a general purpose computer architecture to which embodiments of the invention can suitably be 
ss applied; 

Figure 2 shows schematically a method of compiling source code to a primary and a secondary processor accord- 
ing to an embodiment of the invention; 

Figure 3 illustrates a step of conversion of a DAG to a tree employed in a method step according to one embodi- 
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ment of the invention; 

Figure 4a fflustritf es the step of insertion and deletion of nodes and Figure 4b illustrates the step of substitution of 
nodes in a tree matching process employed in a method step according to an embodiment of the invention; 
Figure 5 shows an edit distance taxonomy provided in an example according to an embodiment of the invention; 
5 Figure 6 illustrates a generic dataflow provided in an example according to one embodiment of the invention; . 

Figure 7 shows a logical interface for allocation of secondary processor resources for a generic dataflow according 
to an embodiment of the invention; 

Figure 8 shows the application of DAGs to dataflows including multiplexers to handle concfflional statements; and 
Figures 9a to 9d show an iDustration of the merging of candidate dataflows to form a generic dataflow in an exartple 
to according to an embodiment of the invention. 

[0019] The present invention is adapted for compilation of source code to an architecture conprismg a primary and 
a secondary processor. An example of such an architecture Is shown in Figure 1. The primary processor 1 is a conven- 
tional general-purpose processor, such as a Pentium II processor of a personal computer. Receiving calls from the pri- 

is mary processor 1 and returning responses to it are secondary processors 2 and (optionally) 4. Each secondary 
processor 2,4 is adapted to increase the computational power and efficiency of the architecture by handling parts of the 
source code not well handled by the primary processor 1 . Secondary processor 4, optionafly present here, is a def- 
eated coprocessor adapted to handle a specific function (such as JPEG. DSP or the like) - the structure of this coproc- 
essor 4 will be determined by a manufacturer to handle a specif ic frequently used function. Such coprocessors 4 are 

so not the specific subject of the present application. By contrast, the secondary processor 2 is not already optirrised for 
a specific function, but is instead configurable to enable improved handling of parts of the source code not well handled 
by the primary processor. The secondary processor 2 is advantageously an application specific structure: it can be a 
conventional FPGA, such as the Xiinx 4013 or any other member of the Xninx 4000 series. An ahemative dass of 
reconfigurable device, referred to as a field programmable arithmetic array, is described in Appendix A hereto. Such a 

25 secondary processor can be configured for high computational efficiency in handling desired parts of the source code 
for an application to be executed by the architecture. 

[0020] Also employed in the computer architecture are memory 3. accessed by the primary processor 1 and. for 
appropriate types of secondary processor 2. by the secondary processor 2. and input/output channel 5. Input/output 
channel 5 here represents all further channels and hardware necessary to enable the user to i^ 

so sors (for example, by programming) and to allow the processors to irteractwith another parts of the cor^ 

[0021] The present invention is particularly relevant to the optimised partitioning of source code between primary 
processor 1 and secondary processor 2, which allows for optimal configuration of secondary processor 2 to optimise 
the handling of the application embodied in the source code by the architecture. A significant contribution is made by 
the invention in the selection and extraction of code for use in the secondary processor. 

35 [0022] The approach taken, according to an embodiment of the invention, is set out in Figure 2. The initial input to the 
process is a body of source coda In principle, this can be in any language : the example described was carried out on 
C code, but the person skilled in the art will readily understand how the techniques described could be adopted with 
other languages. For example, the source code could be Java byte code: if Java byte code could be so handled, the 
architecture of Figure 1 could be particularly well adapted to directly receiving and executing source code received from 

40 .the internet. . 

[0023] As can be seen from Figure 2. the first step in the process is the identification of appropriate canefdate code 
to be executed by the secondary processor 2. Typically, this is done by performing dataflow analysis on the source code 
and building appropriate representations of the dataflows presented by selected lines of code fin most processes, this 
is normally preceded by a manual profiling of the code). This is a standard technique in compiling generally, and appli- 
cation to secondary processors is discussed in, for example. Athanas et al. "An Adaptive Hardware Machine Architec- 
ture and Compiler for Dynamic Processor Reconfiguration V IEEE International Conference on Conputer Desian 1 991 
pages 397-400. 

[0024J The approach taken here is to build directed acyclical graphs (DAGs) which represent the dataflows of selected 
coda An advantageous way to do this is by using a compiler infrastructure appropriately configured for the extraction 
of dataflows: an appropriate compiler infrastructure is SUIF, developed by the University of Stanford and documented 
extensively at the World Wide Web she http^/suif. Stanford edu/ and elsewhera SUIF is devised for compiler research 
for high-performance systems, specifically including systems comprising more than one processor. A standard SUIF 
utility can be used to convert C code to SUIF. ft is then a simple process for one skilled in the art to use SUIF tools to 
build DAGs by performing a dataflow analysis over sections of SUIF and then recortfing the results of the analysis. 
[0025] The extraction of DAGs from source code is a conventional step. The next step in the process, as can be seen 
from Figure 2, is the conversion of these DAGs into trees. This step is a signrf icant factor in making the optimal choice 
of code for execution by the secondary processor 2. DAGs are complex structures, and difficult to analyse in an effective 
manner. Reduction of DAGs to trees allows the aspects of the dataflows most important in determining their mapping 
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to hardware to be reteunixl.wM^ 
icantly more effective. 

[0026] Discussion of the reduction of D AGs to trees is male in "Hi^ Performance Compilers for Parallel Computing" 
(as cited above), especially at pages 56 to 60. Different terminology is used here fro 
5 but equivalent and comparable terms are indicated below. The type of trees constructed here are directly comparable 
to the "spanning trees" referred to in the eft 

[0027] The preferred approach followed in the reduction of DAGs to trees is the removal of links not in the critical path 
between leaf nodes and the root: this is iflustrated in Figure 3. The critical path between nodes A and B is in a first 
embodiment of this reduction process defined as the one that touches the m 
io definition, acyclic, distinct paths can be defined to meet this criterion, it is possible for there to be different paths 
between nodes that have the same maximum number of nodes, but these paths are likely aB to be satisfactory for the : 
purpose of tree construction. While making an arbitrary selection between these paths is a valid approach, a key issue 
in mapping the source code successfully is scheduling, which depends on timing information: accordingly, where it is 
necessary to make a choice between alternative "critical paths" it is desirable to choose the one that would take the 
is longest time (in terms of time taken to execute each of the operations represented by the nodes in the path). As is dis- 
cussed further be* ow, alternative approaches can be adopted which are based more 
also desirable to adopt a consistent approach in making such cho 
result from essentially similar DAGs. 
[0028] The process taken in applying this fto 
20 leaf node, every possfole path towards the root is chased: as the DAG is a directed graph, this is straightforward. As 
indicated above, for each leaf node the path with the greatest number of nodes is chosen, and if two paths are found to 
have the same number of nodes, a selection is made. This is the critical path for that leaf noda All other paths not 
selected are cut in their edge closest to the starting point. This cut edge is termed a minor link (equivalent to tfie term 
"cross-link" in the Wolfe reference). The tree consists of the assembry of crrtk^ paths, arrf 
25 minor links are stored separately. Minor links will be required when extras 

essor 2, but are not used in determining which source code is to be mapped to the secondary processor. . 
[0029] It is of course possible to construct aces from DAGs without using the critical path criterion. Use of the critical 
path does provide particular advantages. In particular, removal as minor finks of the cross-fihks not in the critical path * 
will have little effect on scheduling, whereas if another approach was adopted removed ooss-li 
so erable influence on timing and hence on scheduling. Use of to 

represents as best possible the critical features of the DAG in the context of mapping to hardware. 
[0036] Figure 3 shows the application of the process described in the preceding paragraph. Source code extract 11 
shows three lines under consideration for execution by secondary processor 2. DAG 12 shows these three fines of code 
represented as adirected acyclical graph, with root 126 (variable e) and leaf nodes 121, 129 and 130 as the irputs: 
ss [0031] It is now a straightforward matter to assess each path from a given leaf node to the root, and to corrpare the 
number of nodes in each path. From node 129 (integer value 2), there is only one path, through nodes 122, 123, 124 
and 125. this is then the critical path from leaf node 129 to root node 126, and will represent in the tree. From node 
121 (in the present case the result of an earlier operation and designated c), there are two paths. The first path passes 
through nodes 122, 123, 124 and 125, whereas the second path passes throughnodes 127. I28and 125. The first path 
40 is the critical path, as it passes through more nodes: the second path can thus be cut as is discussed below. The 
remaining leaf node 130 (variable b) also has two paths available: one passes through nodes 123, 124 and 125, 
whereas the other passes through nodes 127. 128 and 125. These are equivalent in terms of number of nodes and so 
either path can be chosen as the critical path: however, for reasons discussed above (throng and morphological con- 
sistency) it is desirable to operate under an appropriate set of further rules to make the best selection. Such further 
45 fdes may. for example, be determined on the basis of the relevant hardware. Here, the second path is chosen 

[0032] The next step to take is to construct a tree 14 from the critical paths chosen from the DAG 12. This is done by 
cutting all noncrrtical paths in their edge closest to the starting point (that is. the edge closest to the starting point which 
is not also part of a critical path). The first non-critical path to consider is that from node 121 to root 126 through nodes 
127. 128 and 125. This can be cut on the edge between nodes 121 and 127 - in the tree, this is represented by removal 
so of edge 151 between nodes 141 (corresponding to 121) and 147 (corresponding to 127) which is stored separately as 
a minor fink The other non-critical path to consider is that from node 130 to root 126 through nodes 123. 124 and 125: 
this can be cut on the edge between nodes 130 and 123. Again, this cut edge is stored as a minor link. 
[0033] It should be noted that conditionals can be represented in DAGs and so reduced to trees in exactly the same 
way as simple equations. An example is shown in Figure 8: this is a DAG representing the dataflow of the Ones. 

cc 
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3 and shows a multiplexer node 186 and a "less than" operation node 186 in addition to the variable and integer nodes 

181, 182, 183 and 184. As the skilled man wflJ appreciate, it win generally be possible to use the approach shown here 
; for source code which can be represented as a DAG. 
[0034] The tree structure that is left- in this case, tree 14 - re a rruich easier structure to 
15 source code should be mapped to secondary processor 2, as is discussed further below. The technique descrfced 
above is a particularly appropriate one for converting DAGs to trees, as it is straightforward to implement is general in 
application, and through use of the critical path maintains the maximum "depth" of the computational engine to be syn- 
thesised (assuming each node represents a single computational element) because of the inclusion of paths with the 
maximum number of nodes. As the person skilled in the art wiD appreciate, alternative approaches to determining which 
j 20 edges are to be removed in converting the DAGs into frees can be adopted. One alternative embocfiment of the DAG to 
• tree reduction process is to assign a timing-based weight to every node (based, for example, on the length of time 

required to execute the corresponding computational element) and then to compare the accumulated weights of each 
path, selecting a path to define the tree accordingly on the basis of, for example, greatest accumulated weight. This 
approach may be more appropriate if the timing parameters of the secondary processor 2 wffl be a critical practical fee- 
as tor and in particular rf the timing dependencies are not mainly related to the mode counted (which may the case in struc- 
tures where, for example, multiplication is several times more time consuming than addition). 
[0035] The next step in the compilation process, as can be seen from Rgure 2, takes frees as iiputsar^ 
the selection of source code for the secondary processor 2. As is further illustrated in Figure 2, this step of the process 
comprises a series of sub-steps. The first of these is the analysis and classification of the frees resulting from the can- 
so cfidate dataflows. This is a significant original step, and is discxissed in detail below. 
[0036] The o^ective in this stage off 

dataflows from the source code would be the best choices for execution by the secondary processor. The is to a large 
degree dependent on the nature of the hardware in the secondary processor. An extremely efficient mapping of source 
code to the secondary processor 2 can be made where dataflows are sufficiently similar that broadly the same hard- 

35 ware representation can be used for each dataflow. It therefore follows that good choices of candidate dataflows for 
mapping to the secondary processor can be made by finding sets of dataflows that are sufficiently simflar to each other. 
This is what is achieved by analysing and classifying the frees resulting from the candidate dataflows. 
[0037] A powerful technique for matching frees, used in this embodiment of the invention, is the tree matching algo- 
rithm devised by Kaizhong Zhang of the University of West Ontario, Canada. 

to [0038] This algorithm is described in Kaizhong Zhang, "A Constrained Edit Distance Between Unordered Labelled 
Trees", Algorithrrtica (1996) 15:205-222, Springer Verlag. and is provided as a toolkit by the University of West Ontario, 
the toolkit being at the time of writing obtainable over the internet from ftp:/ffip.csd.uwo.ca/txib/ta^ 
ft wiD be appreciated that alternative approaches of matching trees to determine a degree of amflarity therebetween are 
available*) the skilled man. The approach to tree matching used in this embodiment of the invention is descrfoed below: 

45 [0039] The principle of operation of Zhang's algorithm is the following: two frees are compared node-by-node through 
a dynamic programming technique that minimises the edit operations required to transform one free into another. This 
cost of transformation is termed here an edit cost. The edit costs of successively larger subtrees are cross-compared, 
| with a record being kept of the minimum costs found. The computational structure can be characterised as that of a 
recursive dynamic program which uses a working dynamic programming grid to calculate component subtree dstances 

so and records the result on the main grid. 

[0040] The ecfit operations available are insertion, deletion and substitution. These are shown in Figures 4a and 4b. 
Figure 4a shows two frees: free 1 51 with five nodes and tree 1 52 with six nodes. The structure of the frees can be made 
identical by addition of a node between nodes 3 and 5 of tree 151 : this new node gives the structure of free 152. Con- 
sequently transformation of tree 151 to tree 152 is achieved by insertion of this node, and transformation of free 152 to 

55 free 151 is achieved by deletion of it (in the CHESS architecture descrfoed in Appendix A. "deletion" is represented in 
hardware by "bypass" of a unit of the array: this is an example of an architecturally designed cost - in this case, a 
extremely low cost). For Figure 4b, the two frees 151 and 152 have the same structure, but the two nodes 4 represent 
a different type of operation in each free: it is therefore necessary to substitute for node 4 in transforming one tree to 
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the other. Every node therefore needs a TabeT: a tag attached to the node which identifies the type of node among the 

various types of node possfole. 

[0041] As previously indicated, eadi of ft 

fa example, the same result may be achieved in some architectures either by an insertion and a deletion, or by a sub- 
5 stitution: the costs of these different alternatives can be compared. 

[0042] The result of the comparison of two trees by this algorithm is the production of a list of pairs of nodes (t1 ,t2), 
where tr belongs to the first tree and 12 belongs to the second free. Ea^ 

points in the two trees, suggesting the mapping of tl and t2 on to each other. The fist of pairs effectively defines the 
skeleton of a tree which can contain either of the compared trees: in this skeleton, to transform the first tree into the 
3 io second tree, each node t1 has to be substituted with the respective t2. Nodes that do not occur in the mapping must be 
f; either inserted or deleted depending on which tree they belong to, as is discussed further below. For this list of pairs 

there win be defned an edit cfistarra 

one tree to the other. The algorithm is devised to deterrrtne an edit distance between two frees, tog 
transformations which achieves that edit distance: alternative transformations wfll be possWe, but they will have a 
rs higher associated cumulative edit cost. 

[0043] The value of computing an ecfrtdfetance based on edit costs is tha^ 

theTiardware cost" in reconfiguring the secondary processor from the configuration representing one free to a config- 
. uration representing the other fr 
ary processor resources that will be taken up to achieve the second configuration given the existence of the first - this 

20 can be considered, for examplp. in terms of the additional area of device used. These costs wfll be determined by the 
nature of the secondary processor hardware, as for different types of hardware the physical realisation of insertion, 
deletion and substitution operations wiD be different For the recbnfigurable CHESS array discussed in Appendix A, a 
"bypass" operation involves minimal cost, a substitution between an adds and subs (addition and subtraction opera- 
tions) has low cost whereas substitution between muls and dh/s {multiplication and cfivision operations) is expensiva 

25 [0044] As indicated above, an edit distant be 

taken: using Zhang's algorithm, or a comparable approach, a taxonomy can be built to show the edit distances between 
each one of a set of trees. This taxonomy can readily be provided in the form of a free, of which an example is shown 
in Figure 5. Each leaf node 161 of the tree represent a candidate tree extracted from a DAG, and each intermediate 
node 162 represents an edit cost The tree provides a unique path between each pair of leaf nodes. The edit distance 

30 between the two leaf rxx^es of a pafr Is fourKf by nation of costs pra 

example, the edit distance between any pair of the leaf nodes representing Tree#4, Tree#5 or Tree#6 is 6. However, the 
edit distance between Tree#1 and Tree#4 is 496: the summation of Intermecfiate nodes with values of 12, 221 , 1 07, 50 
• ■ '-and 6. . 

[0045] This taxonomy is indicative of the number of edit operations required to translate between trees. Such a tax- 
as onomy is a valuable tool, as it can be used heuristically as a metric for the degree of variation between candidate trees. 
The creation of a taxonomy thus renders it easy to determine which frees are sufficiently similar to be consolidated 
together (as will be discussed below), and which are too diverse for this purpose. This can be done by irrposition of an 
edit distance threshold. A group of frees can be selected for consolidation if the edit distance between each id every 
possible pair of frees in the group is less than the edit distance threshold. The value of the edit distance threshold is 
40 arbitrary, ami can be chosen by the person skilled in the art in the context of specific primary and secondary processors 
in order to optimise the performance of the system. 

[0046] The advantage of consolidating a group of frees is that a common hardware configuration can be used for the 
whole group and wfll support the function of each free This is particularly appropriate fpr architectures, such as 
CHESS, in which low-latency partial reconfiguration mechanisms are available on the secondary processor. Recorrfig- 
45 uration is required to change the configuration from that to support the function of one free to that to support the ^ 

of another free: however, as the edit distance between these frees will never be greater than the edit cost threshold, the 
degree of reconfiguration required is already known to be within acceptable bounds. The group of frees are consoli- 
■ dated together by construction of a "supertree" which contains a representation of every component tree After it has 

been constructed, the supertree can be converted into a representation of each of the relevant DAGs extracted from the 
50 source code by reinsertion of the previously removed minor links. The hardware configuration may then be determined 
from the fuB supertree. The construction of the supertree is discussed in detail below. 

[0047] Figure 6 illustrates the step of construction of a supertree from a group of trees which fell below the specified 
edit cost threshold: such a group of trees is here termed a class. The trees 171, 172 and 173 can all be mapped 
together into supertree 1 70. The reconfiguration required to change the hardware configuration from that to support for 
55 example, tree 1 71 to that of free 1 72 is sufficiently limited to be realizable in practice, because the edit distance between 
the two frees is below the edit cost threshold. 

[0048] An exemplary supertree assembly algorithm, merge, is provided as C code in Appendix B. The function of the 
algorithm is described below, with reference to Figure 9. The algorithm contains the following elements: 
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merge : ■ '■ ;• 

[0049] The tree in toe class with the larger number of nodes is chosen to be the initial merge free - if there are trees 
with an equal number of nodes, an arbitrary selection can be made The remaining trees are termed source trees. 
5 [0050] For each source tree the following operations are then applied: 

From the mapping between the source tree and the merge tree which has been calculated (in this entoodiment, 
from Zhang's algorithm and edit costs determined from the secondary processor architecture), the supertree is 
constructed as follows: 

10 ^ ; ' "'" " • • .'• .; ■ •' • . ' .* / • '•• ■•; 

1 . Firstly, mapped nodes closest to the root are considered; 

2. The source tree operation (source operation) is concatenated to the corresponding mapped merge tree 
operation (merge operation): 

3. For each child operation of the source operation 

15 . • • " ' " • " ■ . ' 

a. If frie child is mapped, revert to step 2 with respect to the source ctnld 

b. If the child is not mapped, then cc^er whether there is any m 
is the root (source subtree). 

20 »■ If frere is no further mapping, simply adopt the source subtree** merging into the merge tree under 

the corresponding merge tree node 

ii. If there is a further mapping inside the source subtree, connect the subtree as follows: 

a. If the merge operation of this subordinate mapping fafis outside the previously mapped subtree, 
25 remove the mapped source operation from the source tree. There is recursion present at this 

stage - where mapped children have already been dealt with; all that needs to be done is to 
remove what would otherwise be a cross tree link. 

b. This is shown in Figure 9. If the merge operation of this subordinate mapping does fall within 
the previously mapped subtree, climb up the merge tree untfl the least common ancestor for aJI 

30 contained subordinate mappings is found. The least comiron a 

all of the source mappings. The unmapped source segment is then mapped into the merge tree 
by linking the source operation of the unmapped source subtree as a child of the least common 
ancestors parent and by linking the least common aricestor as the chad of the unmapped 
operation just above the closest mapped source operation in the current subtree (where the "clos- 

35 estmapped source operation" delimits the lower end of an unmapped segment of the source tree, 

and is a mapped node which falls within the subtree of the current mapping - the source node* 
parent, which is unmapped, adopts the merge tree's least common^ancestor as a child and vice 
versa). 

to ;■ • The pair of intermingled trees are rrarmattsed into a single free, whto 

The procedure continues until all the source trees in the class are contained within the merge free, which is now a 
supertree. 



45 



[0051] This process is irxficated in Figure 9. Figure 9a shows two dataflow frees, a merge free 201 and a source free 
202. There are three mappings made between nodes made by the comparison algorithm - the remaining nodes need 
to be inserted appropriately. As indicated in section 1 above, the first step is to consider the mapped operations nearest 
the root - in this case, at the root These operations A are concatenated. 

[0052] After this, the child nodes of A in the source free are considered. Node B does not have a mapping and is not 
an ancestor to any mappings - it is therefore merged as a child of A:A (see Figure 9b). The other child node of A, C, 

so does however have descendant mappings (D and F which map to D and E in the merge free). Both the relevant merge 
operations fall in the previously mapped subtree (as they are both descendants of A). It is therefore necessary to follow 
the course set out in section 3(b)(n)(b) above. The least common ancestor containing both mapped merge operations 
D and E is X. C of the source free is thus finked into the merge free as child of A* (the parent of X) and parent of X. 
This arrangement is shown in Figure 9b - the merging is completed by concatenation or merging of the remaining nodes 

55 of the source tree, all of which steps are straightforward, 

[0053] The resultant supertree 203 is shown in Figure 9c. This supertree 203 acts as merge free for the merging in 
of a further cancfidate source tree 204, as shown in Figure 9d. In this case each node of the source free is mapped into 
a supertree node - merging is thus entirely straightforward, and consists only of concatenation (ie substitution). This 
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process continues until ad the candidate trees are merged into a supertree. 
[O^] Atthis^ge^ 

ary processor. The source code will contain DAGs other than those wh^ 

tree: for example. DAGs which have not been considered because they do not fie at one of the most corhputationaOy 
5 intensive "rot spots" of the code. However, the code of these DAGs may also run more quickly if executed on ^jpropri 
ately adapted secondary processor rather than ™ the prir^ 

remaning DAGs with the supertree by a backmapping process. Processes derived from conventional backmapptng 
techniques, such as I burg , can be utilised for this purpose. However, the most advantageous approach may be to return 
to use of Zhang's algorithm, and match further candidate trees in the source code against the supertree, but this time 
j 10 withalcweredrtco^thresho^ 

tree, or where the edit cost for such a mapping falls below some rrtnimum level, then the code of these DAGs can also 
be allocated to the secaxJary processor and the supertree rrxxfified. if necessary. Control information related to any 
such dataflows added ty this bacl^^ 

[0055] From this supertree. it is then straightforward to Insert the minor links which were removed from the DAGs on 
is their conversion into trees (inducfing here any DAGs added from the backmapping process, if employed). The resulting 
structure is a class dataflow, which represents all the information present in the DAGs of the class: control information 
I for the supertree (for example, todetermine any redonfigiration that is to occur) must also be present This class data- 

jj flow can be used for the purpose of determining the hardware configuration of the secondary processor, and can also 

be used to provide a structure for enabBng strtching back into the source code appropriate calls to the secondary proc- 
- 20 essor: these steps are described further below. 

[0056] Stitching calls to the secondary processor back into the source code in fact requires only the supertree, and 
not the class dataflow, as the supertree prescribes the periphery of the dataflow. The actions required with respect to 
any replaced dataflow in the source code are replacement of inputs of the dataflow (leaves of the tree reduced from that 
dataflow) with load primitives and of the output of the dataflow (root of the relevant tree) with a read. The leaves and 
25 roots of the relevant tree are contained in the supertree; so only the supertree is required for the purpose. Afi remaining 
cede subsumed in the cfataflow c^ 

[0057] Figure 7 shows a logical interface for achieving the necessary substitutions into the source code. An input tree, 
labelled Input Tree #3. is shown, together with a supertree, labelled PFU Tree. Each node in Input Tree #3 has its own 
unique operation ID obtained from the compiler internal form representation. For the supertree (PFU Tree), registers or 
30 other I/O resources are allocated to the leaves and the root The inpficit mapping between Irput Tree #3 and PFU Tree 
thus provides a correspondence between operation IDs of the Input Tree nodes and the I/O resources allocated for PFU 
Tiw in the form of a specif ication. The application of 

removal of the code subsumed by the PFU and the substitution of the neces^ to 

[0058] From the class dataflow, it is possible to configure the secondary processor. This step can be conducted 

ss according to known approaches, by reduction of the class dataflow to a netiist (wit h insert, delete and substitute oper- 
ations, and inducfing in appropriate form any dynamic reconfiguration instructions), and then mapping the netiist to the 
specific secondary processor hardware, taking into account requirements of reconfiguration between component data- 
flows. For conventional FPGA architectures, these steps can be earned out essentially by use of appropriate known 
tools. For example, in the case of a standard Xilinx FPGA such as the XC4013, then appropriate Xilinx proprietary tools 

40. can be used. Firstly, the netiist can be rendered in Xflinx netiist format (XNF). This can then be followed by partitioning 
into configurable logic blocks and input/output blocks by the Xilinx Partition Place and Route program (PPR) P with the 
resultant being converted to a configuration bitstream by the Xilinx MakeBrts program. This approach Is discussed, 
together with further discussion of provision of predetermined reconfiguration solutions, in "Run-Time Programming 
Method for Reconfigurable Computer by Steve Casseimaa currently available on the World Wide Web at 

<5 http://wvw.rec^ a contribution to the World Wide Web roundta- 

ble on reconfigurable computing operated by SB Associates. Inc. of 504 Nino Avenue, Los Gatos, CA 95032, USA. 
Essentially similar procedures can be followed for alternative types of configurable arid recorrfigurabie processor such 
jj as the CHESS device described in Appendix A, using tools appropriate to the processor concerned. 

[0059] Once the source code is generated in executable form with appropriate calls to the secondary processor, and 

so once the secondary processor configuration has been determined, the source code can be loaded and executed. The 
source code is executed in the primary processor with calls to coprocessors and the secondary processor: as the sec- 
ondary processor is specifically adapted to process the dataflows extracted to it. the execution speed of the code is sig- 
nificantly increased. For example, a 25% improvement was found in application of the method of this embodiment of the 
invention to the iDCT algorithm from the JPEG toolkit even though this is in fact a poor problem for mapping to such a 
ss secondary processor because of I/O constrainst 

[0060] The methods here described are thus particularly effective to allow for optimal use of the secondary processor 
in an architecture comprising a primary processor and a reconf igurable secondary processor. 
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APPENDIX A 
CHESS may 

The CHESS array is a variety of field programmable array in which the programmable 
elements are not gates, as in an FPGA, but 4-*it arithmetic logic units (ALUs) The array 
configuration is described in detail in European Patent Application No 97300563.0, and the 
ALU structure and provision of instruction to ALUs is discussed in a copending application 
entitled "Reconfigurable Processor Devices" and filed on the same date as the present 
application. 

The CHESS array consists of a chessboard layout with alternating squares comprising ah ALU 
and a switchbox structure respectively. The configuration memory for an adjacent switchbox 
is held in the ALU. Individual ALUs may be used in a processing pipeline, and in a preferred 
implementation, provision is made to allow dynamic provision of instructions from one ALU 
to determine the function of a succeeding ALU* ALUs are 4-bit, with four identical bhslices, 
with 4-bit inputs A and B taken directly from an extensive 4-bit interconnect wiring network, 
and 4-bit output U provided to the wiring network through an optionally latchable output 
register: 1 -bit carry input and output are also provided and have their own interconnect. 

Dynamic instructions are providable from die output U of one ALU to a 4-bit instruction input 
I of another ALU. The carry output C M of one ALU can also be used as C m of another ALU 
with the effect of changing the instruction of that ALU. 



The CHESS ALU is adapted to support multiplexing between A and B inputs, and also 
supports multiplexing between related instructions (eg OR/NOR, AND/NAND). 
Reconfiguration between such instructions can be achieved through appropriate use of the 
carry inputs and outputs without consumption of silicon. More complex reconfigurations (eg 
AND/XOR, Add/Sub) can be achieved through using two ALUs, the first to multiplex between 
the two alternative instructions and the second to execute the chosen instruction on the 
operands. Multiplication will take up more than a single ALU, making reconfiguration 
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involving a multiplication operation more complex. It is straightforward using the multiplexer 
capacity of a CHESS ALU to "bypass" an operation, with appropriate control resulting in 
5 either performance of the operation or propagation of a given input, 

A sample set of functions obtainable from the instruction inputs is mAirxteA in Table Al 
below: a wide range of possibilities are available with appropriate logic in connection of the 
instruction inputs to the ALU. The functions are described in Table A2. 
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Table A 1 : Instruction bits and corresponding functions 
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| Name 


1 U. function 


Cou, function 


•'5 - ' 


| ADD 


1 A plus B 


Arithmetic cany . 




J SUBA 


1 A minus B 


Arithmetic carry 




| A AND B 


| Uj = Aj AND Bj 




10 , 


| A ORB 


1 Uj = AjORBj 






| ANORB 


| Uj = NOT (Aj OR Bj) 




15 


| AXORB 


1 Uj = AjXORBj 


^out — Ojn 




| ANXORB 


| Uj m NOT (Aj XORBj) 






| : '• A AND B 


| Uj = Aj AND (NOT Bj) 






|" B AND A 


I'-U, = (NOT Aj) AND Bj 


Com — Cjjj . 
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\ Uj = (NOT Aj) OR Bj 


Cgm — 0 jn - 




BORA | 


Uj = Aj OR (NOT Bj) 


C«.f Cin 
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30 1 
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Uj = NOT Aj 


. C^. = C 3 
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U = NOTBj 


Cout Cjn j 




A EQUALS B | 


Not applicable 


if A == BthenO, else 1 | 


35 1 


MATCH 1 1 


Not applicable 


bitwise AND of A and B, I 
followed by OR across J 
width of the word i 


«: 1 

45 


MATCHO | 


Not applicable 


bitwise OR of A and B, 1 
followed by an AND across I 
the width of die word § 



Table A2: Outputs for instructions 



complement arithmetic is used, and the arithmetic carry is provided to be consistent with 
s arithmetic. The MATCH functions are so-called because for MATCH 1 the value of 1 is 
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1. A method of complGngikwrce.code to apri 

s selective extraction of dataflows from the source code; . 

transformation of the extracted dataflows into trees; 

matching of the trees against each other to detenTOnerrinirnum^ 

tree into another; ■ * : 

determining a group or a plurality of groups of daiaftows on ^ 
io and creating for each group a generic dataftow capable of support^ : 

using the generic dataflow or dataflow to determine ^ 

and ■ \\ ' " 

substituting into the source code calls to the secondary processor 
flows, and compiling the resuttant source code to the primary processor. 

2. A method as claimed in claim 1 , wherein said minimum edit cost relationships are embodied in a taxonomy of min- 
imum edit distances for classification of the trees. 

3. A method as claimed in claim i or claim 2, wherein said minimum edit cost relationships are determined according 
20 to the architecture of the secondary processor, and represent a hardware cost of a correspdnding reconfiguration 

of the secondary processor. 

4. Arnetrodas daimedinany of claims 1 to 3, wherein the hardware configuration erf the s^ 
for recorrfiguration of the secondary processor dtiring exeajtion of the source Kxle. 

5. A method as claimed in claim 4. wherein the secondary processor is an application specific instruction processor. 

6. A method as claimed in claim 4, wherein the secondary processor is a field programmable gate array. 

30 7. A method as claimed in claim 4, wherein the secondary processor is a field prograrrariable arithmetic array. 

8. A method as claimed in any of claims 4 to 7, wherein reconfiguration of the secondary processor is required during 
execution of the source code to support each dataflow in me grotp supported by a generic dataflow. 

35 9- A method as daimed in any prececfing claim, wherein a generic dataflow of a groiipe 
mapping of dataflows in the group on to each other, followed by a merge operation^ 

10. A method as claimed in any preceding claim, wherein the dataflows are provided as directed acycJical graphs and 
are reduced to trees by removal of any links in the directed acycfical graphs not present in a critical path between 

<o aleaf node and the root of a cfirec^ acydical graph. 

11. A method as claimed in claim 10, wherein the critical path is a path between two nodes which passes through the 
largest number of intermediate nodes. 

• - ; - ■ >)• • • • .' 

45 12. A method as claimed in claim 10. wherein the critical path is a path between two nodes with the greatest accumu- 
lated execution time. 

1 3. A method as claimed in any of claims 1 0 to 1 2. wherein after the creation of a generic dataflow, the generic dataflow 
is compared with further dataflows extracted from the source code and provided in the manner defined in daim 10. 

so wherein those of said further dataflows which match suffidentJy dosely the generic dataflow are added to the 
generic dataflow. 

14. A method as claimed in any of daims 10 or claim 13 where dependent on claim 9. wherein the removed links are 
stored after the directed acycfical graphs are reduced to trees and are reinserted into the generic dataflow after the 

& merging erf toe trees of the group imp the generic dataflow. " 
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Figure: 4 a 




Figure: 4b 
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Figure; 9d Next Candidate Mapping 
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